{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Fuzzy Citation Query Engine\n",
    "\n",
    "This notebook walks through using the `FuzzyCitationEnginePack`, which can wrap any existing query engine and post-process the response object to include direct sentence citations, identified using fuzzy-matching."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "os.environ[\"OPENAI_API_KEY\"] = \"sk-...\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!mkdir -p 'data/'\n",
    "!curl 'https://arxiv.org/pdf/2307.09288.pdf' -o 'data/llama2.pdf'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install unstructured[pdf]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index import VectorStoreIndex"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[nltk_data] Downloading package punkt to /home/loganm/nltk_data...\n",
      "[nltk_data]   Package punkt is already up-to-date!\n",
      "[nltk_data] Downloading package averaged_perceptron_tagger to\n",
      "[nltk_data]     /home/loganm/nltk_data...\n",
      "[nltk_data]   Package averaged_perceptron_tagger is already up-to-\n",
      "[nltk_data]       date!\n",
      "WARNING: CPU random generator seem to be failing, disabling hardware random number generation\n",
      "WARNING: RDRND generated: 0xffffffff 0xffffffff 0xffffffff 0xffffffff\n",
      "/home/loganm/.cache/pypoetry/virtualenvs/llama-hub-86aDfznI-py3.11/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
      "  from .autonotebook import tqdm as notebook_tqdm\n"
     ]
    }
   ],
   "source": [
    "from llama_hub.file.unstructured import UnstructuredReader\n",
    "\n",
    "documents = UnstructuredReader().load_data(\"data/llama2.pdf\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "index = VectorStoreIndex.from_documents(documents)\n",
    "query_engine = index.as_query_engine()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Run the FuzzyCitationEnginePack\n",
    "\n",
    "The `FuzzyCitationEnginePack` can wrap any existing query engine."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.llama_pack import download_llama_pack\n",
    "\n",
    "FuzzyCitationEnginePack = download_llama_pack(\"FuzzyCitationEnginePack\", \"./fuzzy_pack\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "fuzzy_engine_pack = FuzzyCitationEnginePack(query_engine, threshold=50)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "response = fuzzy_engine_pack.run(\"How was Llama2 pretrained?\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Llama 2 was pretrained using an optimized auto-regressive transformer. The pretraining approach involved robust data cleaning, updating the data mixes, training on 40% more total tokens, doubling the context length, and using grouped-query attention (GQA) to improve inference scalability for larger models. The training corpus included a new mix of data from publicly available sources, excluding data from Meta's products or services. The pretraining methodology and training details are described in more detail in the provided context.\n"
     ]
    }
   ],
   "source": [
    "print(str(response))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Compare response to citation sentences"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Response Sentence:\n",
      " Llama 2 was pretrained using an optimized auto-regressive transformer. \n",
      "\n",
      "Relevant Node Chunk:\n",
      " Llama 2-Chat, a fine-tuned version of Llama 2 that is optimized for dialogue use cases. \n",
      "----------------\n",
      "Response Sentence:\n",
      " Llama 2 was pretrained using an optimized auto-regressive transformer. \n",
      "\n",
      "Relevant Node Chunk:\n",
      " (2023), using an optimized auto-regressive transformer, but made several changes to improve performance. \n",
      "----------------\n",
      "Response Sentence:\n",
      " The pretraining approach involved robust data cleaning, updating the data mixes, training on 40% more total tokens, doubling the context length, and using grouped-query attention (GQA) to improve inference scalability for larger models. \n",
      "\n",
      "Relevant Node Chunk:\n",
      " We also increased the size of the pretraining corpus by 40%, doubled the context length of the model, and adopted grouped-query attention (Ainslie et al., 2023). \n",
      "----------------\n",
      "Response Sentence:\n",
      " The pretraining approach involved robust data cleaning, updating the data mixes, training on 40% more total tokens, doubling the context length, and using grouped-query attention (GQA) to improve inference scalability for larger models. \n",
      "\n",
      "Relevant Node Chunk:\n",
      " Specifically, we performed more robust data cleaning, updated our data mixes, trained on 40% more total tokens, doubled the context length, and used grouped-query attention (GQA) to improve inference scalability for our larger models. \n",
      "----------------\n",
      "Response Sentence:\n",
      " The training corpus included a new mix of data from publicly available sources, excluding data from Meta's products or services. \n",
      "\n",
      "Relevant Node Chunk:\n",
      " 2.1 Pretraining Data\n",
      "\n",
      "Our training corpus includes a new mix of data from publicly available sources, which does not include data from Meta’s products or services. \n",
      "----------------\n"
     ]
    }
   ],
   "source": [
    "for response_sentence, node_chunk in response.metadata.keys():\n",
    "    print(\"Response Sentence:\\n\", response_sentence)\n",
    "    print(\"\\nRelevant Node Chunk:\\n\", node_chunk)\n",
    "    print(\"----------------\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So if we compare the original LLM output:\n",
    "\n",
    "```\n",
    "Llama 2 was pretrained using an optimized auto-regressive transformer. The pretraining approach involved robust data cleaning, updating the data mixes, training on 40% more total tokens, doubling the context length, and using grouped-query attention (GQA) to improve inference scalability for larger models. The training corpus included a new mix of data from publicly available sources, excluding data from Meta's products or services. The pretraining methodology and training details are described in more detail in the provided context.\n",
    "```\n",
    "\n",
    "With the generated fuzzy matches above, we can clearly see where each sentence came from!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### [Advanced] Inspect citation metadata\n",
    "\n",
    "Using the citation metadata, we can get the exact character location of the response from the original document!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Llama 2-Chat, a fine-tuned version of Llama 2 that is optimized for dialogue use cases. \n",
      "{'filename': 'data/llama2.pdf'}\n",
      "----------------\n",
      "(2023), using an optimized auto-regressive transformer, but made several changes to improve performance. \n",
      "{'filename': 'data/llama2.pdf'}\n",
      "----------------\n",
      "We also increased the size of the pretraining corpus by 40%, doubled the context length of the model, and adopted grouped-query attention (Ainslie et al., 2023). \n",
      "{'filename': 'data/llama2.pdf'}\n",
      "----------------\n",
      "Specifically, we performed more robust data cleaning, updated our data mixes, trained on 40% more total tokens, doubled the context length, and used grouped-query attention (GQA) to improve inference scalability for our larger models. \n",
      "{'filename': 'data/llama2.pdf'}\n",
      "----------------\n",
      "2.1 Pretraining Data\n",
      "\n",
      "Our training corpus includes a new mix of data from publicly available sources, which does not include data from Meta’s products or services. \n",
      "{'filename': 'data/llama2.pdf'}\n",
      "----------------\n"
     ]
    }
   ],
   "source": [
    "for chunk_info in response.metadata.values():\n",
    "    start_char_idx = chunk_info[\"start_char_idx\"]\n",
    "    end_char_idx = chunk_info[\"end_char_idx\"]\n",
    "\n",
    "    node = chunk_info[\"node\"]\n",
    "    node_start_char_idx = node.start_char_idx\n",
    "    node_end_char_idx = node.end_char_idx\n",
    "\n",
    "    # using the node start and end char idx, we can offset the\n",
    "    # citation chunk to locate the citation in the\n",
    "    document_start_char_idx = start_char_idx + node_start_char_idx\n",
    "    document_end_char_idx = document_start_char_idx + (end_char_idx - start_char_idx)\n",
    "    text = documents[0].text[document_start_char_idx:document_end_char_idx]\n",
    "\n",
    "    print(text)\n",
    "    print(node.metadata)\n",
    "    print(\"----------------\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Try a random question\n",
    "\n",
    "If we ask a question unrelated to the data in the index, we should not have any matching citaitons (in most cases)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0\n"
     ]
    }
   ],
   "source": [
    "response = fuzzy_engine_pack.run(\"Where is San Francisco located?\")\n",
    "\n",
    "print(len(response.metadata.keys()))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
